多跳问题回答(QA)需要对多个文档进行推理,以回答一个复杂的问题并提供可解释的支持证据。但是,提供支持证据不足以证明模型已经执行了所需的推理来达到正确的答案。大多数现有的多跳质量检查方法也无法回答大部分子问题,即使他们的父母问题得到了正确的回答。在本文中,我们为多跳QA提出了基于及时的保护学习(PCL)框架,该框架从多跳QA任务中获取了新知识,同时保留了在单跳QA任务上学习的旧知识,从而减轻了遗忘。具体来说,我们首先在现有的单跳质量检查任务上训练模型,然后冻结该模型,并通过为多跳质量检查任务分配其他子网络来扩展它。此外,为了调整预训练的语言模型以刺激特定多跳问题所需的推理类型,我们学习了新型子网络的软提示,以执行特定于类型的推理。 HOTPOTQA基准测试的实验结果表明,PCL具有多跳质量质量质量检查的竞争力,并且在相应的单跳子问题上保留了良好的性能,这表明PCL通过忘记通过忘记来减轻知识丧失的功效。
translated by 谷歌翻译
将深度学习与象征性逻辑推理相结合旨在利用这两个领域的成功,并引起越来越多的关注。受到深度循环的启发,这是一种端到端的模型,该模型训练了逻辑程序的推理,我们引入了Ima-Glove-GA,这是一种以自然语言表达的多步推理的迭代神经推理网络。在我们的模型中,推理是使用基于带门注意机制的RNN的迭代记忆神经网络进行的。我们在三个数据集上评估了iMa-glove-ga:副本,Conceptrules V1和Conceptrules V2。实验结果表明,与DeepLo​​gic和其他RNN基线模型相比,深沟和栅极注意可以达到更高的测试精度。当规则被淘汰时,我们的模型比罗伯塔·洛尔格(Roberta-Large)实现了更好的分布概括。此外,为了解决当前多步推理数据集中推理深度分布不平衡分布的问题,我们开发了Pararule-Plus,这是一个大型数据集,其中包含更多需要更深入推理步骤的示例。实验结果表明,添加Pararule-Plus可以在需要更深层次深度的示例中提高模型的性能。源代码和数据可在https://github.com/strong-ai-lab/multi-step-deductive-reasoning-over-natural语言中获得。
translated by 谷歌翻译
有效的多跳问答(QA)需要在多个分散的段落上进行推理,并提供答案的解释。大多数现有方法无法提供可解释的推理过程,以说明这些模型如何得出答案。在本文中,我们提出了一种基于多跳QA的抽象含义表示形式(QDAMR)的问题分解方法,该方法通过将多跳问题分解为更简单的子问题并按顺序回答它们来实现可解释的推理。由于注释分解很昂贵,因此我们首先将理解多跳问题的复杂性委托给AMR解析器。然后,我们通过基于所需的推理类型对相应的AMR图进行分割实现多跳问题的分解。最后,我们使用AMR到文本生成模型生成子问题,并使用现成的QA模型回答它们。 HOTPOTQA的实验结果表明,我们的方法在可解释的推理方面具有竞争力,并且QDAMR产生的子问题是良好的,表现优于现有的基于问题分解的多跳质量质量检查方法。
translated by 谷歌翻译
平均野外游戏(MFGS)提供了一个可在数学上拖动的框架,用于通过利用平均场理论来简化代理之间的相互作用来建模大规模多代理系统。它使应用逆增强学习(IRL)能够通过从展示的行为中恢复奖励信号来预测大人群的行为。但是,现有的MFG的IRL方法无能为力,无法确定各个代理的行为中的不确定性。本文提出了一个新颖的框架,平均场对抗IRL(MF-AIRL),该框架能够解决示范中的不确定性。我们在最大熵IRL和新的平衡概念上建立MF-AIRL。我们通过不完美的演示评估了对模拟任务的方法。实验结果证明了MF-AIRL比奖励恢复中现有方法的优越性。
translated by 谷歌翻译
Different people speak with diverse personalized speaking styles. Although existing one-shot talking head methods have made significant progress in lip sync, natural facial expressions, and stable head motions, they still cannot generate diverse speaking styles in the final talking head videos. To tackle this problem, we propose a one-shot style-controllable talking face generation framework. In a nutshell, we aim to attain a speaking style from an arbitrary reference speaking video and then drive the one-shot portrait to speak with the reference speaking style and another piece of audio. Specifically, we first develop a style encoder to extract dynamic facial motion patterns of a style reference video and then encode them into a style code. Afterward, we introduce a style-controllable decoder to synthesize stylized facial animations from the speech content and style code. In order to integrate the reference speaking style into generated videos, we design a style-aware adaptive transformer, which enables the encoded style code to adjust the weights of the feed-forward layers accordingly. Thanks to the style-aware adaptation mechanism, the reference speaking style can be better embedded into synthesized videos during decoding. Extensive experiments demonstrate that our method is capable of generating talking head videos with diverse speaking styles from only one portrait image and an audio clip while achieving authentic visual effects. Project Page: https://github.com/FuxiVirtualHuman/styletalk.
translated by 谷歌翻译
Masked image modeling (MIM) has shown great promise for self-supervised learning (SSL) yet been criticized for learning inefficiency. We believe the insufficient utilization of training signals should be responsible. To alleviate this issue, we introduce a conceptually simple yet learning-efficient MIM training scheme, termed Disjoint Masking with Joint Distillation (DMJD). For disjoint masking (DM), we sequentially sample multiple masked views per image in a mini-batch with the disjoint regulation to raise the usage of tokens for reconstruction in each image while keeping the masking rate of each view. For joint distillation (JD), we adopt a dual branch architecture to respectively predict invisible (masked) and visible (unmasked) tokens with superior learning targets. Rooting in orthogonal perspectives for training efficiency improvement, DM and JD cooperatively accelerate the training convergence yet not sacrificing the model generalization ability. Concretely, DM can train ViT with half of the effective training epochs (3.7 times less time-consuming) to report competitive performance. With JD, our DMJD clearly improves the linear probing classification accuracy over ConvMAE by 5.8%. On fine-grained downstream tasks like semantic segmentation, object detection, etc., our DMJD also presents superior generalization compared with state-of-the-art SSL methods. The code and model will be made public at https://github.com/mx-mark/DMJD.
translated by 谷歌翻译
Recently, great progress has been made in single-image super-resolution (SISR) based on deep learning technology. However, the existing methods usually require a large computational cost. Meanwhile, the activation function will cause some features of the intermediate layer to be lost. Therefore, it is a challenge to make the model lightweight while reducing the impact of intermediate feature loss on the reconstruction quality. In this paper, we propose a Feature Interaction Weighted Hybrid Network (FIWHN) to alleviate the above problem. Specifically, FIWHN consists of a series of novel Wide-residual Distillation Interaction Blocks (WDIB) as the backbone, where every third WDIBs form a Feature shuffle Weighted Group (FSWG) by mutual information mixing and fusion. In addition, to mitigate the adverse effects of intermediate feature loss on the reconstruction results, we introduced a well-designed Wide Convolutional Residual Weighting (WCRW) and Wide Identical Residual Weighting (WIRW) units in WDIB, and effectively cross-fused features of different finenesses through a Wide-residual Distillation Connection (WRDC) framework and a Self-Calibrating Fusion (SCF) unit. Finally, to complement the global features lacking in the CNN model, we introduced the Transformer into our model and explored a new way of combining the CNN and Transformer. Extensive quantitative and qualitative experiments on low-level and high-level tasks show that our proposed FIWHN can achieve a good balance between performance and efficiency, and is more conducive to downstream tasks to solve problems in low-pixel scenarios.
translated by 谷歌翻译
Rigorous guarantees about the performance of predictive algorithms are necessary in order to ensure their responsible use. Previous work has largely focused on bounding the expected loss of a predictor, but this is not sufficient in many risk-sensitive applications where the distribution of errors is important. In this work, we propose a flexible framework to produce a family of bounds on quantiles of the loss distribution incurred by a predictor. Our method takes advantage of the order statistics of the observed loss values rather than relying on the sample mean alone. We show that a quantile is an informative way of quantifying predictive performance, and that our framework applies to a variety of quantile-based metrics, each targeting important subsets of the data distribution. We analyze the theoretical properties of our proposed method and demonstrate its ability to rigorously control loss quantiles on several real-world datasets.
translated by 谷歌翻译
Recently, large-scale pre-trained models have shown their advantages in many tasks. However, due to the huge computational complexity and storage requirements, it is challenging to apply the large-scale model to real scenes. A common solution is knowledge distillation which regards the large-scale model as a teacher model and helps to train a small student model to obtain a competitive performance. Cross-task Knowledge distillation expands the application scenarios of the large-scale pre-trained model. Existing knowledge distillation works focus on directly mimicking the final prediction or the intermediate layers of the teacher model, which represent the global-level characteristics and are task-specific. To alleviate the constraint of different label spaces, capturing invariant intrinsic local object characteristics (such as the shape characteristics of the leg and tail of the cattle and horse) plays a key role. Considering the complexity and variability of real scene tasks, we propose a Prototype-guided Cross-task Knowledge Distillation (ProC-KD) approach to transfer the intrinsic local-level object knowledge of a large-scale teacher network to various task scenarios. First, to better transfer the generalized knowledge in the teacher model in cross-task scenarios, we propose a prototype learning module to learn from the essential feature representation of objects in the teacher model. Secondly, for diverse downstream tasks, we propose a task-adaptive feature augmentation module to enhance the features of the student model with the learned generalization prototype features and guide the training of the student model to improve its generalization ability. The experimental results on various visual tasks demonstrate the effectiveness of our approach for large-scale model cross-task knowledge distillation scenes.
translated by 谷歌翻译
Crowd counting plays an important role in risk perception and early warning, traffic control and scene statistical analysis. The challenges of crowd counting in highly dense and complex scenes lie in the mutual occlusion of the human body parts, the large variation of the body scales and the complexity of imaging conditions. Deep learning based head detection is a promising method for crowd counting. However the highly concerned object detection networks cannot be well applied to this field for two main reasons. First, most of the existing head detection datasets are only annotated with the center points instead of bounding boxes which is mandatory for the canonical detectors. Second, the sample imbalance has not been overcome yet in highly dense and complex scenes because the existing loss functions calculate the positive loss at a single key point or in the entire target area with the same weight. To address these problems, We propose a novel loss function, called Mask Focal Loss, to unify the loss functions based on heatmap ground truth (GT) and binary feature map GT. Mask Focal Loss redefines the weight of the loss contributions according to the situ value of the heatmap with a Gaussian kernel. For better evaluation and comparison, a new synthetic dataset GTA\_Head is made public, including 35 sequences, 5096 images and 1732043 head labels with bounding boxes. Experimental results show the overwhelming performance and demonstrate that our proposed Mask Focal Loss is applicable to all of the canonical detectors and to various datasets with different GT. This provides a strong basis for surpassing the crowd counting methods based on density estimation.
translated by 谷歌翻译